Noisy Text Clustering

نویسندگان

  • David Grangier
  • Alessandro Vinciarelli
چکیده

This work presents document clustering experiments performed over noisy texts (i.e. text that have been extracted through an automatic process like speech or character recognition). The effect of recognition errors on different clustering techniques is measured through the comparison of the results obtained with clean (manually typed texts) and noisy (automatic speech transcripts affected by 30% Word Error Rate) versions of the TDT2 corpus (∼ 600 hours of spoken data from broadcast news). The results suggest that clustering can be performed over noisy data with an acceptable performance degradation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CONFIRM - Clustering Of Noisy Form Images using Robust Metrics

The ability to automatically cluster large collections of noisy form images according to form type would improve the efficiency of organizations that currently do this by hand. Some noisy form collections contain form types that are structurally very similar, but should cluster apart. To address this issue, we propose CONFIRM Clustering Of Noisy Form Images using Robust Metrics. CONFIRM uses a ...

متن کامل

Effect of Recognition Errors on Text Clustering

This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through an automatic process like character or speech recognition). The effect of recognition errors is investigated by comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech transcriptions) versions of the same speech recording corpus.

متن کامل

Clustering with Side Information for Mining Text Data

Side information is available along with text document in several text mining application. They are the different kind of side information such as document provenance information, the link in the document, other non textual attributes which are contained into the document or user access behavior from web logs. Some attributes may contain extremely large amount of information for clustering purp...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Noise Clustering Approach to Speaker Verification

In a speaker veri ̄cation system, a claimed speaker's score is computed to accept or reject the speaker claim. Most of the current methods compute the score as the ratio of the claimed speaker's and the impostors' likelihood functions. Based on analysing false acceptance error obtained by using these methods, we propose a noise clustering approach to ̄nd better scores which can reduce that error....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004